Learning to summarize from human feedback
https://arxiv.org/abs/2009.01325
データセット openai/summarize_from_feedback
In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences.
ROUGEのような既存の評価指標でなく、reward modelで要約を評価する、ということ?(積ん読)
Figure 5
Reward Hacking
KL Penaltyによる解決
流れとしてはFine-Tuning Language Models from Human Preferencesから